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Abstract 

Computational RNA secondary structure prediction is 
rather well established. However, such prediction algo- 
rithms always depend on a large number of experimen- 
tally measured parameters. Here, we study how sensi- 
tive structure prediction algorithms are to changes in 
these parameters. We find that already for changes corre- 
sponding to the actual experimental error to which these 
parameters have been determined 30% of the structure 
are falsly predicted and the ground state structure is pre- 
served under parameter perturbation in only 5% of all 
cases. We establish that base pairing probabilities cal- 
culated in a thermal ensemble are a viable though not 
perfect measure for the reliability of the prediction of in- 
dividual structure elements. A new measure of stability 
using parameter perturbation is proposed, and its limi- 
tations discussed. 



Introduction 

In an endeavor to understand the workings of any or- 
ganism, one cannot ignore the importance of ribonucleic 
acids (RNA) Q. RNA molecules transmit genetic infor- 
mation through the cell. They are also intimately in- 
volved in many important biological processes such as 
translation, regulation, and splicing. In addition to its 
importance for today's organisms RNA is also an inter- 
esting molecule to study due to its likely role as a major 
player during the origin of life @] • 

The function of a given RNA is determined by its phys- 
ical structure. This structure is encoded in the sequence 
of the four nucleotides (or bases) A, U, G, and C from 
which each RNA molecule is composed. Determining 
that structure in the laboratory is a laborious, and of- 
ten unsuccessful, undertaking. Thus, it has become an 
interdisciplinary task to determine these structures from 
the sequences alone. 

The encoding of a structure in the sequence is real- 
ized by specific interactions between the bases. The by 
far most important of these interactions is the forma- 
tion of A-U and G-C base pairs, also known as Watson- 
Crick pairs. With the formation of each base pair, the 
Gibbs free energy of the structure is lowered, and thus, 
the structure's stability is increased. Since the sequence 
of bases that defines the RNA is finite, the number of 
possible structures into which a given RNA can fold is 
also finite. The most thcrmodynamically likely structure 
to be formed is the structure with the lowest free energy 
known as the minimum free energy structure. 

Although the number of possible structures for a given 
sequence is enormous, computer algorithms, such as the 
Vienna Package Q or MFOLD Q , can find the minimum 



free energy structure or the full partition function of the 
ensemble of all structures given a sequence in a time that 
is proportional to the third power of the sequence length 
due to a recursive relationship d, IS 0- The problem 
with these algorithms lies in the calculation of the free 
energies of the structures. The contribution to the free 
energy attributed to a base pairing or the formation of 
various substructures such as various kinds of loops is 
measured experimentally and used as parameters in the 
RNA folding algorithm. For various reasons, these pa- 
rameters contain errors. On the one hand, several effects 
such as steric interactions between different regions of 
the structure, pseudo-knots, base triplets, or even inter- 
actions with proteins or other RNA molecules are not 
reflected at all in the underlying free energy model. All 
these effects result in systematic errors in the free en- 
ergy parameters. On the other hand, there are ordi- 
nary, non-systematic errors of measurement associated 
with these parameters as well. Thus, while the algo- 
rithm guarantees to find the minimum energy structure 
within the energy model provided by the experimentally 
determined parameters, this structure does not have to 
be the true (and thus biologically realized) minimum free 
energy structure. 

The goal of this paper is not to discuss the causes of 
these experimental errors but simply, to investigate the 
consequences these errors have regarding structure pre- 
diction. Our approach is to randomly modify the mea- 
sured free energy parameters within a range compara- 
ble to the experimental errors and to record how much 
the predicted minimum free energy structures change. 
We find that already around 30% of the structures are 
changed if the free energy parameters are varied within 
the experimental error. While this is a rather sobering 
result, we at least find that base pairing probabilities 
evaluated in a thermal ensemble are very good priors to 
estimate which parts of the structure prediction are reli- 
able and which are not. 



Materials and Methods 

In this section we will discuss how RNA structure pre- 
diction algorithms work and how we model experimental 
error in the free energy parameters. This will provide the 
necessary background for our study. 

a. RNA secondary structure: The strongest inter- 
action between the bases of an RNA molecule is the 
formation of Watson-Crick base pairs. Therefore, one 
distinguishes two levels of structure, namely secondary 
and tertiary structure (with the primary structure just 
being another name for the sequence of the molecule.) 
A secondary structure is defined as the collection of all 
base pairs that have formed without regard to any spa- 



2 



tial organization of the backbone. The tertiary struc- 
ture then includes the actual spatial organization and el- 
ements formed by less stable interactions than base pair- 
ing such as base triplets, backbone contacts mediated by 
divalent ions, etc. Since base pairing is energetically more 
important than the other interactions, it is meaningful to 
talk about the secondary structure of an RNA molecule 
without considering the tertiary structure Q . This is the 
point of view of the algorithms that we will study here; 
therefore we also will discuss secondary structure only for 
the remainder of this paper. 

In order to make secondary structure prediction com- 
putationally feasible, it is necessary to exclude so-called 
pseudoknots from the allowed secondary structures. Such 
a pseudoknot exists if bases i and j form a base pair and 
bases k and I form a base pair and these two base pairs 
are nested as i < k < j < I or k < i < I < j. While these 
pseudoknots do appear in biological structures, they are 
bound to be short due to kinetic constraints. Thus, they 
can be omitted in secondary structure prediction ||| and 
be considered part of the tertiary structure of a molecule. 

b. Energy model: A secondary structure as defined 
above can be drawn as shown in Fig. ^ It can be de- 
composed into a large number of loops that arc cither 
stacking loops, bulges, interior loops, hairpins, or multi- 
loops as also indicated in the figure. The main assump- 
tion of the generally accepted free energy model is that 
the total free energy of a secondary structure is the sum 
of independent contributions from all of its loops. These 
loop contributions depend on the identity of the bases 
in a loop and on the length of the loop. E.g., the free 
energy of a stacking loop depends on the identity of all 
four bases that form the two base pairs a stacking loop is 
formed by leading to (after taking into account symme- 
try) 21 different free energy parameters for stacking loops 
(if in addition to G-C, and A-U also the wobble base pair 
G-U is allowed). For short bulges and interior loops the 
number of parameters increases by a factor of four for ev- 
ery unpaired base in the loop. Longer loops arc typically 
only characterized by their length and by the identity of 
the unpaired bases immediately next to the base pairs 
defining the loop in order to avoid an explosion of pa- 
rameters. Nevertheless, a complete free energy model 
is described by on the order of a thousand parameters 
that are determined experimentally [ij. Since all these 
parameters arc true free energies, i.e., differences of ener- 
getic contributions such as chemical binding energy and 
bending energy and entropic contributions from the inte- 
grated out spatial degrees of freedom of the backbone and 
the surrounding water, each parameter depends on the 
temperature. In our study we will keep the temperature 
constant at physiological 37°C. 

c. Perturbations of the energy model: In order to 
study the sensitivity of structure prediction to thermo- 
dynamic parameters, one must perturb the parameters. 
For simplicity we assume that the error in the param- 
eters is roughly Gaussian distributed. We model these 
errors by adding to every single free energy parameter a 
Gaussian random variable with mean zero and standard 
deviation, e, i.e., with a probability density 
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FIG. 1: Schematic of an RNA secondary structure. The solid 
line represents the backbone of the molecule while the dashed 
lines symbolize base pairs. Each such structure can be de- 
composed into stacking loops (s), bulges (b), interior loops 
(i), hairpin loops (h), and multi-loops (m) as indicated. 

In doing so we take great care to preserve the inherent 
symmetry in the parameters (e.g. the energy of a stack- 
ing loop made from a GC-pair and an AU-pair (-GA- 
paired with -UC-) being equal to the energy of a stack- 
ing loop made from an UA-pair and a CG-pair, i.e., -UC— 
paired with -GA-) . The parameter e serves as a measure 
of the magnitude of the experimental error inherent in the 
parameters. 

We will explore a whole range of different values for the 
parameter e to understand the sensitivity of predicted 
structures to perturbations of the free energy parame- 
ters. In order to get an idea what experimental errors 
on the free energy parameters are in reality, it is most 
illustrative to look at the stacking energies since stacking 
energies have been measured in many different labora- 
tories. In the case of the more studied DNA stacking 
energies seven independent measurements have been sys- 
tematically compared . This study reports in addition 
the detailed free energy parameters the stacking energies 
averaged over the different types of stacking for each of 
the seven independent experiments. If we take the aver- 
age of these averages we find it to be — 1.4±0.3kcal/mol. 
Since the experimental procedure for the determination 
of RNA free energy parameters is the same as for DNA 
free energy parameters we conclude that a good estimate 
for the experimental error is 0.3kcal/mol. Another in- 
dication that this is the order of magnitude of the ex- 
perimental error of the stacking free energies is that the 
additive free energy model itself is experimentally known 
to break down at this level of precision jl ll| . This implies 
that this uncertainty is not due to a lack of experimen- 
tal techniques of higher precision (which could in prin- 
ciple be overcome by new experimental developments) 
but that these uncertainties are unavoidable on principle 
grounds. Since we do not have very good estimates of 
the experimental error of the other free energy param- 
eters we uniformly apply the same error estimate to all 
free energy parameters. 
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Results 

If the predicted minimum free energy (mfe) structure, 
also referred to as the ground state, has a far lower free 
energy than any alternative structure, the ground state 
is said to be thermodynamically stable. For the purposes 
of this paper, we are interested in another type of sta- 
bility. We will call a structure unstable with respect to 
parameter perturbation should the predicted mfe struc- 
ture require a strict adherence to one or more thermody- 
namic parameters in order to remain the predicted mfe 
structure. We will quantify this stability in the following 
in two different ways. 

d. Distance of structures: The first way we study 
this instability is by looking at what fraction of a struc- 
ture is still predicted correctly once parameters are per- 
turbed. To this end, we need some quantitative method 
of comparison for structures. In this study, we will use 
the normalized tree distance Q to quantify the amount 
by which mfe structures at different free energy param- 
eter choices differ. We convinced ourselves that other 
solely structure based measures such as the string dis- 
tance Q lead to the similar results as the ones presented 
here. The tree distance is based on a metric that views 
a secondary structure as being defined by a tree diagram 
where the leafs of the tree are the bases and the topology 
of the tree represents the structure in an intuitive way. 
The tree distance is then defined as the number of ele- 
mentary operations on the tree such as cutting a branch 
and attaching it at a different place in the tree. For a 
more detailed description of this difference measure the 
reader is referred to Ref. Since the tree distance is 
a number between zero (for identical structures) and the 
length of the sequence, we rescale the tree distance by the 
length of the sequence. This scaling allows us to compare 
the stability of sequences of different lengths and permits 
a more intuitive interpretation of the data. For example, 
a scaled distance of 0.2 stands for a 20% difference in 
structure. 

For our study we choose a series of natural sequences 
(namely group I introns) with length varying between 
227 and 685, that is, RNA sequences which have been 
observed in biological systems. For each of these se- 
quences the mfe structure is predicted by the computer 
algorithm from the Vienna package Q using the accepted 
experimentally measured parameters In addition, we 
determine the mfe structures of the same sequences for 
a hundred sets of randomly perturbed energy parame- 
ters for each uncertainty e. We calculate the scaled tree 
distance between each mfe structure calculated with per- 
turbed parameters and the corresponding mfe structure 
obtained with unperturbed parameters (i.e. the exper- 
imentally measured values without alteration) and av- 
erage these distances over the 100 realizations for each 
sequence and each e. 

Fig. El shows these averaged distances as a function of 
the perturbation parameter e. One should note the over- 
all instability of the structure prediction of these natural 
sequences. As can be seen in Fig. [21 there is already a 
significant deviation of about 30% from the ground state 
structure at e sa 0.3kcal/mol which roughly corresponds 
to the actual experimental error of the parameters 0] . 



For e ps lkcal/mol already half of the structure can no 
longer be predicted. 
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FIG. 2: The average scaled tree distances of various natural 
sequences for different perturbations of the parameters e. The 
error bars denote the statistical error after averaging over 100 
different realizations of perturbed free energy parameter sets. 
The solid line shows the average over all sequences. It be- 
comes obvious that already at the experimental uncertainty 
of e si 0.3kcal/mol about 30% of the structure is predicted 
unreliably. 

e. Ground state probability: An alternative, and as 
it turns out even more sensitive, measure of stability is 
the probability that the ground state structure will be 
the predicted structure at a given perturbation e. To 
estimate this quantity we determine the mfe structure 
of our sequences for 1000 sets of randomly perturbed pa- 
rameters for each perturbation strength e and classify the 
parameter sets according to the mfe structures. Since we 
catalog the parameter sets which produce each structure, 
the frequency at which the "correct" (i.e., calculated with 
the accepted free energy parameters) structure occurs 
can be determined, as well as the number of alternative 
structures possible at a given e. 

As can be seen in table 1, the ground state structure 
is most likely not the true structure for one sequence for 
e as small as 0.02, and for all sequences for e of 0.22. 
At the experimental error rate, i.e., e f» 0.32, only a 
mere 5% of the parameter perturbations reproduce the 
same ground state. For this part of the study we also in- 
clude data on randomly generated sequences of varying 
length in addition to data on the group I introns used 
in the rest of this manuscript. Table 1 shows that the 
ground state stability for these sequences is even worse 
than for the group I introns. From this data, one can 
see that the current error in the thermodynamic param- 
eters casts serious doubt upon the structural predictions 
made by folding algorithms. Since this error is at least 
to a good part due to fundamental limitations in the en- 
ergy model and thus cannot be significantly reduced just 
by better measurements, predictions from folding algo- 
rithms should never be taken at face value but always be 
subjected to critical crosschecking. 

/. Reliability estimation: Given that we obviously 
have to live with the fact that on the order of 30% of 
a secondary structure will be incorrectly predicted just 
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Accession 


Length 


e = 


: 0.02^ 


e = 


°- 12 ^fr 


e 


°- 22 S 


e = 


0.32^fi 


= 0.42^4 

mol 


e = 0.52^ 

mol 


Random 


190 





100.0% 


147 


10.5% 


31 1 


7.4% 


576 


7.3% 


829 


1.1% 


983 


0.5% 


Random 


210 


1 


95.7% 


291 


2.3% 


541 


2.2% 


846 


1.0% 


977 


0.6% 


989 


0.1% 


AJ228695 


227 





100.0% 


20 


46.8% 


128 


15.8% 


350 


5.9% 


619 


2.1% 


829 


1.1% 


Random 


230 


15 


34.0% 


195 


5.5% 


445 


3.5% 


855 


1.6% 


994 


1% 


991 


2% 


AJ228705 


248 





100.0% 


38 


58.1% 


223 


18.4% 


574 


6.4% 


873 


1.8% 


965 


0.6% 


U83261 


243 


4 


98.0% 


49 


23.1% 


241 


8.8% 


591 


3.4% 


880 


1.0% 


976 


0.2% 


Y13474 


256 





100.0% 


23 


86.1% 


167 


32.7% 


491 


6.0% 


794 


1 2% 


940 


n 3% 


Random 


270 


4 


76.2% 


353 


3.4% 


681 


1.3% 


926 


0.2% 


993 


0.2% 


999 


0.1% 


Random 


290 


1 


99.9% 


134 


23.1% 


513 


1.7% 


908 


0.2% 


996 


0.1% 


995 


0.1% 


Random 


330 


9 


58.8% 


327 


7.1% 


909 


1.1% 


986 


0.2% 


997 


0.1% 


999 


0.1% 


Random 


350 


2 


79.5% 


243 


8.1% 


581 


2.2% 


990 


0.1% 


998 


0.1% 


999 


0.1% 


M38691 


376 


6 


61.5% 


516 


2.6% 


941 


0.2% 


996 


0.1% 


1000 


0.1% 


1000 


0.1% 


Random 


390 


6 


27.1% 


321 


11.8% 


846 


0.5% 


997 


0.1% 


999 


0.1% 


1000 


0.1% 


V01416 


426 


7 


40.4% 


219 


9.4% 


749 


1.7% 


977 


0.4% 


999 


0.2% 


1000 


0.1% 



TABLE I: Frequency at which the ground state is predicted (right) and the number of alternate structures predicted (left) as 
a function of the parameter perturbation e. Entries labeled "Random" are randomly generated squences. All others are group 
I introns. 



because of the uncertainties in the free energy parame- 
ters, the question comes up if it is at least possible to 
find out which parts of the structure are the reliable ones 
and which are the unreliable ones. If the RNA secondary 
structure prediction algorithm is used to calculate the full 
partition function for a given sequence instead of the mfe 
structure only, it can assign to every pair of bases 
a probability p.ij that these two bases are paired within 
a thermal ensemble @. Since these probabilities can be 
calculated in the same time as the mfe structure, these 
probabilities arc a convenient measure for the reliability 
of the prediction of an individual base pair. However, it 
is not a priori clear if a high probability in the thermal 
ensemble corresponds to stability with respect to uncer- 
tainties in the free energy parameters. 

To study if the thermal probabilities have any meaning 
for the stability of a base pair with respect to parameter 
changes, we compare the thermal ensemble directly to 
an ensemble of mfe structures calculated with perturbed 
parameters. To this end we calculate the mfe structure 
for a given sequence for 1000 different sets of perturbed 
free energy parameters. We catalog the resulting base 
pairs, and determine the frequency with which they oc- 
cur. Then, we compare this frequency to the thermal 
ensemble probability pij calculated at the accepted pa- 
rameters for each individual base pair. 

The comparison of the base pair frequencies versus the 
thermal ensemble probabilities is shown in Fig. [3] for a 
representative sequence (we convinced ourselves that the 
results are qualitatively similar for all sequences). We 
observe that at small e, since few or no alternative struc- 
tures are predicted, the plot appears to be very much 
a step function; base pairs which the thermal ensem- 
ble predicts to have a significant probability ( ss 40% 
or more) occur while less likely base pairs do not. As e 
is increased, more alternative structures begin to appear, 
and one can see the edges of the step begin to smooth. 
By e = 0.32kcal/mol a strong correlation is apparent 
even though there is a clear spread. If increased beyond 
e = 0.32kcal/mol, the correlation between the base pair 



frequency and the thermal ensemble probability differs 
in no significant qualitative way. From these correlations 
we deduce that as the level of parameter perturbations 
reaches the value of e = 0.32kcal/mol the pool of alter- 
native secondary structures minimizing perturbed energy 
parameters and the pool of suboptimal structures probed 
in the thermal ensemble become similar. This is in a 
way surprising since the thermal energy itself is about 
0.6kcal/mol and thus much smaller than the perturba- 
tion of the total free energy of a structure obtained by 
perturbing every free energy parameter by 0.32kcal/mol. 
It might imply that the base pair probability of an in- 
dividual base pair is only sensitive to perturbations of 
a few key free energy parameters that delineate different 
low free energy structures from each other. Whatever the 
reason for the observed correlation, we can conclude that 
the easily calculable thermal probabilities are a good esti- 
mate for the sensitivity of a given base pair to parameter 
perturbations. 



Discussion 

We studied the sensitivity of RNA secondary structure 
prediction to perturbations of the free energy parameters. 
The main result is that if the free energy parameters are 
perturbed within a range that is supposed to be the ex- 
perimental uncertainty with which these parameters have 
been determined, about 30% of the structure turns out 
to be unreliable and the chance of predicting the same 
ground state as with the unperturbed parameters is only 
5% even for moderately sized sequences with lengths up 
to 426. Given this imprecision, we found that at least 
base pairing probabilities calculated in a thermal ensem- 
ble are reasonably well correlated with the probabilities 
that a base pair will be unaffected by uncertainties in 
the free energy parameters. These results support the 
commonly employed method of using thermal ensemble 
probabilities to sort out which parts of the structure can 
be trusted and which cannot. However, although cal- 
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FIG. 3: Correlation between the probability to find a base pair in an mfe structure with perturbed parameters and the 
probability of the base pair in the thermal ensemble for the sequence with accession number Y13474. Each point represents a 
base pair and its position represents its respective probabilities of forming. The different plots show how the two probabilities 
become more correlated as the strength e of the parameter perturbations is increased. 



culation of the thermal ensemble [l3L[T3 | is expedient, it 
offers only knowledge of how base pairings will behave on 
a individual basis, and not how they will behave in con- 
cert. The ground state probability method gives one not 
only a probabilistic measure of the accuracy of the predic- 
tion, but also all the probable alternative structures and 
some gauge of their likelihood of being the true structure. 
Should one have the computer power, one should always 
check the ground state and individual base pair probabil- 
ities using the method outlined in this paper. Doing so 
is imperative if the thermal ensemble suggests a dubious 
structure prediction. The amount of time sacrificed for 
the additional information is dependent upon how accu- 
rate one wishes the probabilities to be. With advances 
in computer chip technology, the extra factor of 100 to 
1000 in computation time involved in using the ground 
state probability as opposed to using the thermal ensem- 
ble may soon become a more practical investment in cases 
where it is of big importance to know in addition to the 
predicted structure which parts of the structure are likely 
to be correctly predicted and which should be discarded 
as simple artifacts of the imprecisions of the free energy 
model. 



Acknowledgments 

This research was supported by the Research Experi- 
ence for Undergraduates programs of the National Sci- 
ence Foundation through grant number PHY-0242665. 



[1] Couzin,J. (2002) Small RNAs make big splash. Science 
298, 2296. 

[2] Gilbert.W. (1986) Origin of life-the RNA world. Nature 
319, 618 . 

[3] Hofacker,I.L., W. Fontana,W., Stadler,P.F., Bonhoef- 
fer L.S., Tacker,M. and Schuster,P. (1994) Fast folding 
and comparison of RNA secondary structure. Monatsh. 
Chem. 125, 167. 

[4] Zuker.M. and Jacobson,A.B. (1995) Well-determined re- 
gions in RNA secondary structure prediction-analysis of 
small-subunit ribosomal RNA. Nucl. Acids Res. 23, 2791. 

[5] de Gennes,P.G. (1968) Statistics of branching and hair- 
pin helixes for dAT copolymer. Biopolymers 6, 715. 

[6] Waterman, M.S. (1978) Secondary structure of single- 
stranded nucleic acids. Adv. Math. Suppl. Studies 1, 167. 

[7] McCaskill,J.S. (1990) The equilibrium partition-function 
and base pair binding probabilities for RNA secondary 
structures. Biopolymers, 29 1105. 

[8] Tinoco,I. Jr. and Bustamante,C. (1999) How RNA folds. 
J. Mol. Biol. 293, 271. 

[9] Mathews,D.H., Sabina,J., Zuker,M. and Turner,D.H. 



(1999) Expanded sequence dependence of thermody- 
namic parameters improves prediction of RNA secondary 
structure. J. Mol. Biol. 288, 911. 

[10] SantaLucia,J. (1998) A unified view of polymer, dumb- 
bell, and oligonucleotide DNA nearest-neighbor thermo- 
dynamics. Proc. Natl. Acad. Sci. USA 95, 1460. 

[11] Kierzek,R., Caruthers,M.H., Longfellow, C.E., Swin- 
ton,D., Turner,D.H. and Freier,S.M. (1986) Polymer- 
supported RNA-synthesis and its application to test the 
nearest-neighbor model for duplex stability. Biochemistry 
25, 7840. 

[12] Fontana,W., Konings,D.A., Stadler,P.F. and Schuster,P. 
(1993) Statistics of RNA secondary structures. Biopoly- 
mers 33, 1389. 

[13] Zuker,M. and Jacobson,A.B. (1998) Using reliability in- 
formation to annotate RNA secondary structures. RNA 
4, 669. 

[14] Mathews, D.H. (2004) Using an RNA secondary structure 
partition function to determine confidence in base pairs 
predicted by free energy minimization. RNA 10, 1178. 



